Endometrial cancer in women with abnormal uterine bleeding: Data mining classification methods

Background: Over the last decade, artificial intelligence in medicine has been growing. Since endometrial cancer can be treated with early diagnosis, finding a non-invasive method for screening patients, especially high-risk ones, could have a particular value. Regarding the importance of this issue, we aimed to investigate the risk factors related to endometrial cancer and find a tool to predict it using machine learning. Methods: In this cross-sectional study, 972 patients with abnormal uterine bleeding from January 2016 to January 2021 were studied, and the essential characteristics of each patient, along with the findings of curettage pathology, were analyzed using statistical methods and machine learning algorithms, including artificial neural networks, classification and regression trees, support vector machine, and logistic regression. Results: Out of 972 patients with a mean age of 45.77 ± 10.70 years, 920 patients had benign pathology, and 52 patients had endometrial cancer. In terms of endometrial cancer prediction, the logistic regression model had the best performance (sensitivity of 100% and 98%, specificity of 98.83% and 98.7%, for trained and test data sets respectively,) followed by the classification and regression trees model. Conclusion: Based on the results, artificial intelligence-based algorithms can be applied as a non-invasive screening method for predicting endometrial cancer.

One of the most common cancers among women is endometrial cancer, which is the fourth most common cancer in the United States after breast, lung and colorectal cancers (1). Various causes can be considered risk factors for endometrial cancer, high levels of estrogen, high blood pressure, premature menstruation, late menopause, tamoxifen use, nulliparity, Lynch syndrome and old age are risk factors (55≤) (1)(2)(3)(4)(5)(6). According to the American Cancer Society, all women over the age of 65 should be aware of the risk factors and symptoms of endometrial cancer so that they can be referred for further evaluation if any symptoms occur. Abnormal uterine bleeding is one of the most common manifestations of endometrial cancer, especially after menopause, which should be evaluated (1). Endometrial cancer is on the rise due to increased life expectancy and the prevalence of obesity (2). Evidence suggests that endometrial adenocarcinoma is more likely to be treated than other female cancers; observing the early signs of abnormal vaginal bleeding causes patients to see a specialist more quickly and receive treatment in the early stages of the disease (7,8). Although abnormal uterine bleeding is the most common symptom, it is not a good indicator of endometrial cancer. Only 10% of women with cancer present with this symptom, and 90% of women undergoing aggressive diagnostic procedures do not have cancer (9). Numerous studies have been conducted to find endometrial cancer screening tools at the primary care level to determine which patients should be evaluated further (9). Artificial intelligence is a powerful mathematical tool and can be used significantly to promote public health (9). Machine learning is a branch of artificial intelligence through which a computer system learns potential patterns using existing data and helps identify complex patterns (10). It is a scientific order that concentrates on how computers learn from data (11).
Conventional biostatistical methods are not suitable for managing complex data (10,12). The benefits of using machine learning over traditional statistical methods include flexibility and scalability (12). Machine learning algorithms are used to analyze various and complex data types in large quantities and are used to predict disease risk, classification, prognosis, diagnosis, and suitable treatment (12). In recent years, advanced classification techniques like artificial neural networks, classification and regression trees, Support Vector Machine, and logistic regression have been used widely for the prediction of many diseases, including cancers (13)(14)(15)(16)(17)(18)(19)(20)(21). Since endometrial cancer can impose an economic burden on the health system and it is very important to address this issue, in this study we have tried to use statistical approaches to investigate the importance of factors associated with endometrial cancer in women with abnormal uterine bleeding. Therefore, the present study aimed to utilize features and build predictive models to estimate a function for mapping input (features) to output (cancerous or non-cancerous). We can eventually indicate the best classification model as a non-invasive predictive and screening tool for endometrial cancer.

Study design:
In this cross-sectional study, 972 patients with abnormal uterine bleeding who were referred to the gynecology clinic of Imam Hossein Hospital in Tehran, Iran, were examined from 2016 to 2021. After receiving the ethical approval from the ethics committee of the Vice-Chancellery of Research at Shahid Beheshti University of Medical Sciences (code: IR.SBMU.RETECH.REC.1400.461) and informed written consent from patients, they were entered into the study. Inclusion criteria are women with abnormal uterine bleeding at reproductive or menopausal age who underwent physical examination, radiological and laboratory assessment, and endometrial sampling through dilatation and curettage. Pathological results were reported as benign for 920 and malignant for 52 patients, respectively. Then, based on the data and with the help of artificial intelligence algorithms such as Artificial Neural Networks and traditional machine learning models such as Logistic Regressions, Classification and Regression Trees, and Support Vector Machine, we evaluate the risk factors connected to endometrial cancer and compare the predictive accuracy of these algorithms with each other. Thus, some features were rerecorded for each case within the existing database, including age, body mass index, type of abnormal uterine bleeding, size of uterus in physical exam, history of other diseases, history of pregnancy, menarche age, menopausal age, menopausal status, family history of cancer, tamoxifen use, and endometrial thickness with the size of the uterus on ultrasound. Present study utilized these features and built predictive models to estimate a function for mapping input (features) to output (cancerous or noncancerous). Statistical analysis: Quantitative data were presented as mean and standard deviation or median and interquartile range, and qualitative data were presented as frequency and percentage. The chi-square test, independent sample t-test, and Mann-Whitney test were applied for univariate analysis; all the variables with p<0.05 and frequency more significant than 10% were included further in classification models. All statistical analyses were performed using a 0.05 significance level.
In the present study, four different methods were selected to determine and compare their predictive accuracy in the diagnosis of endometrial cancer. The existing datasets were divided into two parts: training data and test data (with a ratio of 4 to 1). In the training phase, the model tries to find the best parameters and weights for the function. The goal is to have the best classification with the least error rate on the test set. The following methods were implemented in Python programming language and Scikit-learn framework as an efficient tool for predictive data analysis. Logistic Regression: A logistic regression model predicts dependent data variables by analyzing the relationship between one or more existing independent features. It is widely used to predict several diseases; hence it is of great importance to be implemented. In addition, the final logistic regression model was assessed by Hosmer and Lemeshow test. Classification and Regression Trees: Classification and regression trees were used to classify data into two categories (22). Moreover, they can identify characteristics and develop rules; hence, they are known as an explanatory method valuable for experts in medicine. The representation for the classification and regression trees model is a binary tree, and each root node represents a single input variable (x) and a split point on that variable. The tree's leaf nodes contain an output variable (y) which is used to make a prediction. Support Vector Machine: Support Vector Machine is a supervised learning approach that organizes data into categories and is a machine learning algorithm that analyzes data for classification. By distinguishing hyper-planes in a high-dimensional feature space, the goal of a Support vector machine is to construct a computationally efficient way of learning (23). Numerous hyper-planes could be used to classify two sets of data. The hyper-plane with the most significant margin should be picked as the best option. The margin is the maximum width that the boundary can increase before colliding with a data point. The data points that the margin pulls up are referred as support vectors (figure 1-a). As a result, the support vector machine's aim was to determine the best hyper-plane for separating categories of target vectors on opposite sides of the plane (23). Artificial Neural Network: Artificial neural networks are computational networks inspired by biology. It is a computer modeling approach that learns from examples through iterations without requiring prior knowledge of the relationships between process parameters. As a result, it can deal with uncertainty, noisy data, and non-linear correlations, unlike many traditional methods based on linear techniques. Artificial neural networks are well suited for classification and prediction tasks in practical circumstances because of their capacity to learn from a specific data set (24)(25)(26).
We concentrated on multilayer perceptron (24,25) with back propagation learning algorithms among the numerous algorithms used in artificial neural networks in this study. The multilayer perceptron is a supervised artificial neural network that has three types of layers: input, hidden, and output. They are the most often used artificial neural network for a wide range of issues. The multilayer perceptron is made up of a network of artificial neurons that are coupled so that the output of one neuron becomes the input of one or more other neurons. Specifically, an input layer of neurons receives the input data, one or more hidden layers, and eventually an output layer that provides the network's output. A typical architecture of a multiperceptron model has been drawn ( figure 1-b). The weighted input values to a single neuron are aggregated using a vector to scalar function like summation (i.e., = ∑ ) averaging, input maximum or mode value produce a single input value to the neuron. The neuron then utilizes an activation function to produce its output after calculating the input value (and consequently the input signals for the next layer). The activation function transforms the input value of the neuron. A sigmoid, hyperbolic-tangent or other nonlinear function is commonly used in this transformation. The structure of a single neuron is depicted in (figure 1-c).

Results
Nine hundred seventy-two patients with a mean age of 45.77 ± 10.70 years (ranging from 25 -85 years) were recruited in this study, of which 52 (5.3%) cases were diagnosed with endometrial cancer. A significant difference was observed between the two groups (benign and malignant) with regard to age (p<0.001), body mass index (p<0.001), menarche age (p<0.001), history of pregnancy (P=0.032), uterus size in bimanual exam (P=0.032), postmenopausal bleeding (p<0.001), menometrorrhagia (P=0.016), menorrhagia (P=0.002), metrorrhagia (P=0.003), history of cancer in patients (p<0.001), diabetes mellitus (p<0.001), hypertension (p<0.001), polycystic ovarian disease (P=0.001), hypothyroidism (P=0.009), menopause (p<0.001), menopause age (P=0.008), and endometrial thickness (p<0.001).Statistically significant differences were not observed between the two groups regarding family history of cancer, Tamoxifen use, and infertility (table 1).  2). The most important variable was menarche age, followed by endometrial thickness and patients' age, for those with menarche age younger than 11.5 years (category 1) and endometrial thickness and hypertension for those whose menarche age was older than 11.5 years (category 2). Subjects presented in category 1 had a higher risk of endometrial cancer if they had an endometrial thickness size of greater than 16.5 mm and they were older than 51.1 years (P=100%) also, younger patients who had endometrial thickness size of (23.5 -51.1) mm were more prone to endometrial cancer risk (P=100%), while the risk of endometrial cancer was lower among younger patients whose endometrial thickness size was in the range of 16.5 -23.5 mm (P=57.1%). Other rules were in favor of normal endometrium. The best performance of all methods was attributed to the logistic regression model (sensitivity of 100% and 98%, specificity of 98.83% and 98.7%, positive predictive value of 82.97% and 82.85%, and the negative predictive value of 100% and 99.89% for trained and test data set respectively), followed by the classification and regression trees model (table 3). Pergialiotis et al. (2018) described that artificial neural network had more sensitivity and specificity than classification and regression trees and logistic regression for the prediction of endometrial cancer in postmenopausal women, respectively (9). Hutt et al. (2021) reported that neural network models were useful tools in the prediction and risk calculation of endometrial cancer (28). Artificial intelligence can use radiological images to predict some evidence about cancer. A 2020 study by Dong et al. determined that artificial intelligence could help the radiologist interpret magnetic resonance images or be an acceptable alternative for assessing the depth of preoperative myometrial invasion in patients with stage one endometrial cancer (29). Contrary to the advantages, some restrictions of using artificial intelligence include clinicians' perception of machine learning models, ethical challenges, and data collecting (12). Strengths and Weaknesses: This study determined crucial risk factors for endometrial cancer using statistical approaches and then made models for predicting endometrial cancer using machine learning algorithms. On the other hand, these models can be used to decrease the probability of endometrial cancer by selecting at-risk patients and applying preventive strategies. We had limitations to access electronic records of other centers to increase our data set. Moreover, we could hardly find medical students or doctors who know artificial intelligence. Implications for Practice and Future Research: Through multi-central studies, we could increase our data sets to reach better results and improve models' sensitivity. The artificial neural networks, support vector machine, classification and regression trees, and logistic regression models recruited in this study had acceptable and close overall accuracy and may help diagnose endometrial cancer with less invasive and expensive methods. It seems that, in the computer and digital era, physicians need to do more research based on artificial intelligence and its contributing role in medical science and diagnosing and treating diseases.